Multilingual Word Segmentation and Part - of - Speech Tagging : a Machine Learning Approach Incorporating Diverse Features ∗
نویسنده
چکیده
The aim of this dissertation is to study statistical methods for multilingual word segmentation and POS tagging with high accuracy. Word segmentation and part-of-speech (POS) tagging are fundamental language analysis tasks in natural language processing, and used in many applications. Existence of unknown words is a large problem in these tasks and they need to be properly handled. We attempt to develop suitable methods for word segmentation and POS tagging which can utilize informative features effectively. Firstly, we study a method for unknown word guessing and part-of-speech tagging using support vector machines (SVMs), which can handle a number of features effectively. We apply the method to English unknown word guessing and part-of-speech tagging. Secondly, we propose a method for POS guessing of unknown words using global information as well as local information. Global features often give useful information for POS guessing, and the method takes into consideration interactions between the POS tags of all the unknown words in a document by using Gibbs sampling. We apply the method to Chinese, Japanese and English unknown word guessing. Thirdly, we propose a word segmentation method which combines the existing word-based method and character-based method, in order to compensate for the ∗Doctoral Dissertation, Department of Information Processing, Graduate School of Information Science, Nara Institute of Science and Technology, NAIST-IS-DD0461022, February 2, 2006.
منابع مشابه
Is Arabic Part of Speech Tagging Feasible Without Word Segmentation?
In this paper, we compare two novel methods for part of speech tagging of Arabic without the use of gold standard word segmentation but with the full POS tagset of the Penn Arabic Treebank. The first approach uses complex tags without any word segmentation, the second approach is segmention-based, using a machine learning segmenter. Surprisingly, word-based POS tagging yields the best results, ...
متن کاملAn improved joint model: POS tagging and dependency parsing
Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...
متن کاملArabic Part of Speech Tagging
Arabic is a morphologically rich language, which presents a challenge for part of speech tagging. In this paper, we compare two novel methods for POS tagging of Arabic without the use of gold standard word segmentation but with the full POS tagset of the Penn Arabic Treebank. The first approach uses complex tags that describe full words and does not require any word segmentation. The second app...
متن کاملA Long Dependency Aware Deep Architecture for Joint Chinese Word Segmentation and POS Tagging
Long-term context is crucial to joint Chinese word segmentation and POS tagging (S&T) task. However, most of machine learning based methods extract features from a window of characters. Due to the limitation of window size, these methods can not exploit the long distance information. In this work, we propose a long dependency aware deep architecture for joint S&T task. Specifically, to simulate...
متن کاملChinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based?
Chinese part-of-speech (POS) tagging assigns one POS tag to each word in a Chinese sentence. However, since words are not demarcated in a Chinese sentence, Chinese POS tagging requires word segmentation as a prerequisite. We could perform Chinese POS tagging strictly after word segmentation (one-at-a-time approach), or perform both word segmentation and POS tagging in a combined, single step si...
متن کامل